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ABSTRACT 

Existing methods for orthologous gene mapping 
suffer from two general problems: (i) they are com- 
putationally too slow and their results are difficult to 
interpret for automated large-scale applications 
when based on phylogenetic analyses; or (ii) they 
are too prone to making mistakes in dealing with 
complex situations involving horizontal gene trans- 
fers and gene fusion due to the lack of a sound basis 
when based on sequence similarity information. 
We present a novel algorithm, Global Optimization 
Strategy (GOST), for orthologous gene mapping 
through combining sequence similarity and context- 
ual (working partners) information, using a combina- 
torial optimization framework. Genome-scale 
applications of GOST show substantial improve- 
ments over the predictions by three popular 
sequence similarity-based orthology mapping 
programs. Our analysis indicates that our algorithm 
overcomes the intrinsic issues faced by sequence 
similarity-based methods, when orthology mapping 
involves gene fusions and horizontal gene transfers. 
Our program runs as efficiently as the most efficient 
sequence similarity-based algorithm in the public 
domain. GOST is freely downloadable at http:// 
csbl.bmb.uga.edu/~maqin/GOST. 

INTRODUCTION 

Orthologous genes refer to genes that have evolved from a 
common ancestor through speciation only (1). A widely 



accepted corollary, especially for bacterial genomes, is that 
they are functional equivalents, i.e. they play the same 
functional role in the equivalent biological processes 
across different organisms. Identification of orthologous 
genes across genomes, or 'orthologous gene mapping', 
represents the most essential technique in comparative 
genomics, but the problem remains largely unsolved. 
One key issue is that the definition of orthologous genes 
is not operational unless phylogenetic trees could be ac- 
curately derivable and analysis methods are available 
for distinguishing orthologous from paralogous genes, 
which by themselves are very challenging and unsolved 
problems. Because of the nature of the problem, the 
majority of the computer programs developed for 
solving the problem have been generally empirical in 
nature and often lack a sound theoretical basis. 

The current orthology-mapping programs generally fall 
into two categories, phylogeny-based and sequence 
similarity-based (2). In the first category, gene trees need 
to be constructed, followed by rather involved analyses of 
the constructed trees to derive orthologous gene relation- 
ships. Programs such as RIO (3), Orthostrapper (4), RSD 
(5), Mestortho (6), OMA (7) and QuartetS (8) fall into this 
category. The best example of sequence similarity-based 
methods is the Reciprocal Best Hit (RBH) program (9), 
which predicts the orthologous gene in a target genome 
for a given gene A in a query genome by finding a gene B 
in the target genome so that B is the best Blast hit in the 
target genome for A, and vice versa. Cluster of 
Orthologous Groups (COG)/eukaryotic Orthologous 
Groups (KOG) (10) 'generalize' RBH by considering 
three genomes instead of two, aimed to increase the pre- 
diction accuracy of RBH but suffered from low predic- 
tion coverage. In addition, there are several more recent 
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developments aimed to further improve the prediction re- 
liability, including INPARANOID (11) and OrthoMCL 

(12) . However, systematic analyses indicate that RBH 
is still the one to beat in terms of prediction accuracy 
among all sequence similarity-based methods as long as 
two specific parameters, soft filtering and final alignment 
Smith-Waterman, are adjusted properly (13). In addition, 
RBH is the most time efficient compared to other 
orthology mapping methods (7). 

Phylogeny-based methods are generally more reliable 
than sequence similarity-based methods, while the latter 
are generally orders of magnitude more efficient, and 
hence are capable of dealing with genome-scale applica- 
tions. A recent survey (7) suggested that the substantially 
more time needed for phylogeny-based orthologous gene 
prediction may not be worthwhile for the limited increase 
in prediction accuracy over sequence similarity-based 
methods. It is fair to say that RBH represents the state 
of the art in orthologous gene prediction for large-scale 
applications. In their recent review, Chen et al. (2) 
reported that RBH tends to have high prediction 
accuracy (i.e. low false positive) but suffers from low pre- 
diction coverage (i.e. high false negatives) while programs 
like KOG and OrthoMCL tend to have the opposite 
behavior. One general issue with all these sequence 
similarity-based methods is that they all implicitly 
assume that orthologous relationship can be captured by 
sequence similarity information alone among homologous 
genes, which is clearly not true (14). There is no real bio- 
logical reason of why two orthologous genes should have 
the best sequence alignment score among possible hom- 
ologous alternatives, particularly when the similarity 
scores are close. 

We here present a novel program Global Optimization 
Strategy (GOST) for orthologous gene mapping across 
bacterial genomes, which to a large extent overcomes the 
intrinsic issues faced by sequence similarity-based methods 
discussed above. The fundamental difference between 
GOST and the existing sequence similarity-based 
methods is that GOST is designed to find orthologous 
gene pairs across two genomes with a good 'enough' 
sequence similarity score under the condition that the 
two genes have homologous working partners in their re- 
spective genomes (throughout the article two genes are 
said to be homologous if their sequence similarity is 
below a specific is-value threshold by BLAST). Here two 
genes in a genome are considered as 'working partners' if 
they share a common 'uber-operon' (15), which general- 
izes our previous work where we defined such a relation- 
ship based on operons (14). We demonstrated the 
effectiveness of this strategy on a large set of bacterial 
genomes by showing that GOST outperforms three 
popular sequence similarity-based orthology mapping 
programs, RBH, INPARANOID and OrthoMCL by sub- 
stantial margins in terms of prediction 'coverage, mislabel- 
ing error rate' and 'missing rate', which are commonly 
used to assess orthologous genes prediction programs 

(13) . We further compared GOST with RBH, 
INPARANOID and OrthoMCL in their predictions 
when mapping Escherichia coli enzyme-encoding genes 
to all sequenced bacterial genomes against known 



orthologous relationships as documented in the 
SwissProt Enzyme database (16). Specifically GOST 
identified 665 more enzyme gene pairs than RBH, 1901 
more than INPARANOID and 2354 more than 
OrthoMCL. We believe that the performance of GOST 
is actually better than what these numbers suggest as the 
Enzyme database contains only a small portion of all the 
orthology relationships among enzyme-encoding genes 
across these bacterial genomes. Overall, GOST is much 
more efficient than OrthoMCL and INPARANOID, 
and is as fast as RBH (see Supplementary Table SI in 
the Supplementary Data). 

MATERIALS AND METHODS 

Data 

We used E. coli K12 as the query genome and other 959 
complete bacterial genomes from NCBI (release of April 
2009) as the targets for orthologous gene mapping. The 
operon information was downloaded from the DOOR 
database (17) on 1 November 2009. The SwissProt 
database was downloaded from http://www.expasy.org/ 
enzyme/ on 5 October 2010. 

Orthologous gene mapping: problem formulation 

We first introduce a few graph-theoretic definitions needed 
for our problem formulation and solution. A graph is 
called a 'multi-graph' if more than one edge is allowed 
between two vertices in the graph. A 'bipartite' graph 
B = (X, Y, E) is a graph whose vertex set can be parti- 
tioned into two subsets X and Y so that no edge exists 
between vertices of the same subset. A 'matching' M is a 
subset of E such that no two edges share a common vertex. 
A 'maximum matching' is a matching of the maximum 
cardinality. A graph is said to be 'connected' if for any 
pair of vertices, there is a path between the two vertices 
within the graph. A 'component' in a graph is a maximum 
connected sub-graph. 

Let Gj and G 2 be the gene sets of two given bacterial 
genomes. Define a bipartite graph B = (G/, G 2 , E), termed 
a 'homology' graph, where two vertices (one from each 
genome) are connected by an edge in E if and only if the 
Blast i?-value between the corresponding genes is below a 
pre-defined threshold (see Method 1 in the Supplementary 
Data). One possible way to formulate the orthologous 
gene mapping problem is through finding a maximum 
matching in the bipartite graph B [we noticed that a 
article was just published using this formulation for the 
orthologous gene mapping problem (18) as we were 
writing this article] One way to include biological 
process information into the problem formulation is 
through application of operons as genes in the same 
operon generally work in the same biological process. 
Specifically we can constrain the bipartite matching 
problem by requiring that each gene pair in a matching 
has at least one additional pair of homologous genes 
sharing their operons. However our previous study 
showed that this constraint led to rather low mapping 
coverage. Hence we looked into a generalized form of 
operons, i.e. uber-operons (15). A 'uber-operon' is a 
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group of operons in a genome whose operons are func- 
tionally related, and their union is conserved across 
multiple (reference) genomes more frequently than 
expected by chance [we refer the reader to (15) for 
details of the definition]. We believe that uber-operons 
are the evolutionary foot-prints of ancient operons, 
which were split in different ways into smaller operons 
along different evolutionary lineages (15). In this article 
we formulate the orthologous gene mapping problem 
through (implicitly) including the uber-operon informa- 
tion into sequence similarity-based-procedure. 

Consider two genomes for orthologous gene mapping. 
Define a new graph G = (O, M), with vertex set O con- 
sisting of operons of the two genomes, and with the edge 
set being the current matching M of B (not necessarily 
maximum). Clearly G is a multi-graph since there might 
be multiple edges of M between two vertices (operons) in 
O. Let c(0, M) denote the number of connected compo- 
nents in G. Based on the above discussion, we found that a 
maximum matching M in B gives rise to an orthologous 
gene mapping if and only if M maximizes c(0, M). Let 
M* be the set of all the maximum matchings in B. Hence 
the orthologous gene mapping between the two genomes is 
to find a maximum matching M in M* such that c(0, M) 
is maximized. Let M be the set of all the matchings in B. 
It is easy to check that the following two optimization 
problems 

max{c(0,M):M e M*} and max{|M|+c(0,M):M e M] 

have the same optimum solution. Therefore, the problem 
of finding orthologous genes mapping can be modeled as 
to find an optimum solution to the following problem: 

max{\M\+c(0,Af) :M e M] 

Although this optimization problem is theoretically in- 
tractable (it is relatively simple to prove that it is 
NP-hard), it is easy to get an optimum solution virtually 
with probability = 1.0 in this particular context because of 
the following observation: the orthologous gene pairs (or 
matched edges in B) in two conserved uber-operons (one 
from each genome) are always denser than those in two 
unrelated uber-operons. We predict each edge in the 
calculated optimum matching as a pair of orthologous 
genes across the two genomes under consideration. The 
following gives a high-level description of our algorithm. 

For the simplicity of presentation, we introduce another 
weighted complete graph, G* = (V, £), with vertex set 
V consisting of all the connected components of G and 
edge set E consisting of all the pairs of vertices in 
V. For each edge e = {C\, C 2 ), its weight is defined as 
w(e) = Af 12 — \M\\ — \M 2 \, where M x and M 2 are the re- 
strictions of M on C[ and C 2 , respectively, and M 12 the 
maximum matching of the subgraph C 12 which is induced 
in B by the union of C\ and C 2 . It is obvious from the 
definition of G* that G* depends on the current matching 
M. The algorithm starts with M being empty. At this 
moment the adjoining graph G is a graph with vertex set 
O consisting of all the operons from G\ and G 2 and with 
edge set being empty, and the graph G* a weighted 
complete graph with vertex set V the same as O and the 



weight of an edge being the cardinality of a maximum 
matching between the two operons connected by the edge. 

Input: query genome G\ and target genome G 2 . 

Output: a maximal set of orthologous gene pairs 
between the two genomes. 

Step 1. Construct graphs B = (G\, G 2 ), G = (0, M) and 

G* = (V, E) with M = 0. 
Step 2. Finding an edge e = (C\, C 2 ) of G* with weight 

w(e) biggest. 

If w(e) = 0, go to Step 4. 

Step 3. Merge the two components C\ and C 2 into one, 
reset M = M - M x - M 2 + M l2 and modify G and G* 
accordingly; return to Step 2. 

Step 4. Output the current matching M, i.e. a maximal 
orthologous mapping. 



RESULTS AND DISCUSSION 

A challenge in assessing orthologous gene mapping 
programs, particularly on large scale applications, is that 
there is no widely accepted benchmark dataset of 
orthologous genes across different genomes. We used the 
following methods to evaluate GOST and other three 
programs (RBH, INPARANOID and OrthoMCL) 
across 959 bacterial genomes in terms of (i) whether the 
predicted gene pairs have their working partners being 
homologous; and (ii) whether the predicted orthologous 
enzyme-encoding genes have the same enzymatic func- 
tions according to the enzyme database, commonly used 
to assess orthology mapping programs (7,13). In addition, 
we have examined the performance of the programs on a 
selected set of challenging cases that involve horizontal 
gene transfers and gene fusions. 

Prediction assessment on genome-scale predictions 

Prediction 'coverage' is defined as the number of the pre- 
dicted orthologous gene pairs between a query and a 
target genome. Two genes in a genome are called 
'operon-based working partners' if and only if they are 
in the same operon. Two homologous genes, one in the 
query and one in the target genome, are considered to be 
correctly predicted if they each have an operon-based 
working partner, which are homologous. A gene in a 
target genome is called a 'supported' gene if one of its 
operon-based working partners is a homologous gene of 
some query gene. We use the following measures to assess 
the prediction programs. A 'missing error' is made for a 
gene in the query genome if this gene has homologous 
supported genes in the target genome as detected by 
BLAST and this gene itself is not predicted to be an 
ortholog of any gene in the target genome. A 'mislabeling 
error' is made for a gene in the query genome, if this gene 
has at least two homologues x and x' in a target genome 
with x predicted to be its ortholog, and a gene sharing an 
operon with it has a homolog y with y and xf being in the 
same operon in the target genome [our definitions of 
missing error and mislabeling error are different from 
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the similar definitions given in (13) as we used operons in 
our definition. However our performance assessment is 
done using their definitions which is well explained in 
Method 2 and Supplementary Figure SI in the 
Supplementary Data, and see Supplementary Figure S2 
for detailed performance]. Together they are referred to 
as 'errors'. The 'missing error rate' is defined as the ratio 
between the number of 'missing errors' and the number of 
'errors' plus the number of correctly predicted 
orthologous gene pairs; and the 'mislabeling error rate' 
is defined similarly. Supplementary Table S2 lists the pre- 
diction coverage, missing error rate and mislabeling error 
rate for the four programs across 959 bacterial genomes. 
The details about comparisons of the distributions of 
coverage, missing error rate and mislabeling error 
rate against genomic similarity score (GSS) (13) are in 
Figure 1. 

By comparing the detailed prediction results given in 
Figure 1, we noted that GOST consistently outperforms 
the three other programs across the above three measures, 
especially when GSS is low, i.e. in (0.2-0.5), which 



accounts for 85% of the 959 target genomes. On both 
types of errors, GOST outperforms the other three algo- 
rithms by a significant margin although all four programs 
have low mislabeling error rates. 

Prediction performance on enzyme-encoding genes 

While assessing the performance of orthologous gene 
mapping programs remains an unsettled problem due to 
the reality that there is no widely accepted benchmark 
dataset, performance assessment on a special class of 
genes, i.e. enzyme-encoding genes, can be readily made. 
Specifically, we consider orthologous enzymes in the 
SwissProt database (16) as true orthologs, which can be 
used to assess the performance of the four prediction 
programs on enzyme-encoding genes. We noted that 
some E. coli enzymes have gene assignments only in a 
few genomes, which could lead to incorrect conclusions 
about the performance of the four programs under 
testing if including such enzymes when assessing perform- 
ance statistics. Hence we consider only enzymes with gene 
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Figure 1. A comparison of distributions of (a) prediction coverage, (b) missing error rates and (c) mislabeling error rates by RBH, INPARANOID, 
OrthoMCL and GOST against GSS. Since RBH and GOST do not consider co-orthologs, we count each group of predicted co-orthologous genes as 
one for INPARANOID and OrthoMCL. 
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Figure 2. Performance comparison between GOST and the other three programs. The x-axis represents the difference between the number of 
mapped enzyme genes by GOST and one of the other three programs, grouped into bins, where [X, Y] represents enzymes over which GOST 
predicted X- Y more orthologous genes across all the target genomes. The height along the y-axis of a bar represents the number of enzymes within 
each range. N^X) represents the number of orthologous genes of E. coli genes predicted by program X for each target genome g. 



assignments in at least 5% of the 959 target genomes, 
i.e. 48 genomes, which leaves 419 E. coli enzyme-encoding 
genes for orthology mapping. In the remaining of this 
section, an E. coli gene refers to one of these 419 genes 
unless stated otherwise. Note that our evaluation method 
could have limitations knowing that (i) some of the 
expert-curated enzymes could possibly have errors; and 
(ii) SwissProt Enzyme database contains only a small 
portion of all the orthology relationships among 
enzyme-encoding genes under consideration. So the 
reader may need to take caution in interpreting the com- 
parison results. 

For each E. coli K12 gene e, let NJX) be the number of 
genes predicted to be an ortholog of e across all bacterial 
genomes under consideration by program X. For each 
target genome g, let N S (X) be the number of orthologous 
genes of E. coli genes predicted by X. Programs A and B 
are considered to have the same level of performance over 
an enzyme e if \N e (A) - N e (B)\ < K, and the same level of 
performance over a genome g if \N S (A) - N S (B)\ <K for 
some positive number K (in our current study, K = 5). A is 
said to perform better than program B over enzyme e 
(respectively genome g) if N e (A) - N e (B) > K (respectively 
N g (A)-N g (B) > K). The frequency distributions of 
iV e (GOST)-/V e (RBH), 7V e (GOST)-/V e (INPARANOID) 
and /V e (GOST)-/V e (OrthoMCL) binned into a specific 
range, say between 5 and 10, are given in Supplementary 
Figure S3. We can see from the figure that GOST 
performs at least as well as the three other programs for 
the majority of the 419 enzymes (385 enzymes for RBH, 
358 enzymes for INPARANOID and 347 enzymes for 
OrthoMCL). Interestingly, GOST performs substantially 
better than the other programs over a few enzymes, e.g. it 
predicts 172 more orthologous genes for EC 2.7.2.8 
(iV-acetylglutamate kinase) and 99 more genes for EC 



2.7.13.3 (histidine kinase) than RBH. Further inspection 
indicates that gene fusions or horizontal gene transfers are 
the main reasons that have affected the performance of the 
other three programs substantially more than GOST as 
shown in the analysis below. 

Figure 2 shows a genome-centric perspective of the pre- 
dictions by the four programs. Specifically, GOST 
performs at as the same level of the three other 
programs for most organisms (945 genomes for RBH, 
879 genomes for INPARANOID and 877 for 
OrthoMCL) and it outperforms RBH on 13 genomes, 
INPARANOID on 77 genomes, and OrthoMCL on 79 
genomes while GOST was outperformed by RBH on 
zero genome, by INPARANOID on two genomes and 
by OrthoMCL on two genomes. 

In the following analyses, we compare only 
between GOST and RBH since Figure 2 shows that 
RBH is the best among the three programs that we 
compare against. 

GOST has a higher sensitivity than RBH 

Overall GOST and RBH predicted 1 263 642 (circle A in 
Figure 3a) and 1 165 246 orthologous gene pairs (circle B), 
respectively, from the 4124 E. coli genes to the 959 target 
genomes. 1 113 013 of these gene pairs are predicted by 
both programs, shown as the interaction C of A and B 
in Figure 3a. GOST has X= 150 629 unique predictions 
and RBH has Y = 52233 unique ones; and GOST predicts 
~ 100 000 more orthologous gene pairs than RBH. 

X consists of two types of GOST predictions for each 
target genome: XI and X2 involving E. coli genes not 
covered and covered by RBH for the current target 
genome, respectively. Yl and Y2 are defined similarly 
for RBH (more details can be found in Supplementary 
Figure S4 in the Supplementary Data). Figure 3b shows 
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Figure 3. (a) Comparison of orthologous predictions between GOST 
(circle A) and RBH (circle B), X represents unique predictions by 
GOST and consists of two non-overlapping subsets XI and X2 which 
involve E. coli genes not covered and covered, respectively, by RBH for 
the current target genome. Y, Yl and Y2 are similarly defined, (b) An 
example in XI and (c) an example in X2. Each block arrow represents 
a gene, and each rectangular box represents an operon. A directed edge 
represents a pair of homologous genes and the directed edge pointing 
two ways corresponds to a bidirectional best hit between two genes. 
The gene pair in red is identified by GOST, the one in yellow is sup- 
porting information to the red gene pair and the gene in green is the 
ortholog predicted by RBH. 



an example of XI, with the (g, t) gene pair predicted by 
GOST while z is the best BLAST hit of t. Here GOSTs 
prediction is supported by the information that there is a 
second homologous gene pair (g' , f), which share two 
operons with (g, t) in the two genomes. RBH fails to 
identify an orthologous gene of g as there exist no 
bi-directional best hits in the target genome. Overall, 
there are 125 831 such cases missed by RBH. 

Figure 3c shows an example of X2, in which GOST 
predicted (g, t) as an orthologous pair and RBH predicted 
(g, m) as the orthologous pair for the same pair of 
genomes. We noted that GOSTs prediction is supported 
by a second homologous gene pair (g', t') sharing the two 
operons. Overall, among the 24 798 orthologous pairs in 
X2, 7800 (31.5%) pairs have been found to have support- 
ing evidence similar to the above. In contrast, only 182 
(0.7%) such cases in Y2 have been found to have this type 
of supporting evidence. 



GOST can resolve complicated orthology-mapping 
problems involving gene fusions 

Separated genes in one organism can be fused into a single 
gene in another organism, which is known as 'gene fusion' 
(19). Gene fusion can make a sequence similarity-based 
orthology mapping program fail. Figure 4a shows 
an example highlighting the challenge faced by programs 
like RBH. GI:16128036 (shown in red) is an E. coli gene 
and its orthologous gene is fused with another gene 
(its E. coli ortholog is GI: 16129654) in Slackia. 
heliotrinireducens DSM 20476, forming gene 
GL257063013. The best BLAST hit of E. coli 
GI: 16128036 is GL257063013 in S. heliotrinireducens 
while the reciprocal best hit of GL257063013 in E. coli is 
GI: 16129654 rather than GI: 16128036, hence RBH failed 
to call this orthologous gene pair while GOST is able to 
make the correct call. Overall out of the 125 831 mapped 
orthologous genes in XI (Figure 3), 13 841 (11%) encode 
multi-domain proteins based on searches against the 
Conserved Domains Database (20). 

In addition, 5896 (20.4%) mapped genes out of the 
24 798 X2 genes encode multi-domain proteins, again 
highlighting the general issue faced by RBH. We 
provided one such example in Figure 4b (see correspond- 
ing gene tree in Supplementary Figure S5 in the 
Supplementary Data and more examples can be found 
in Supplementary Table S3). In this case, E. coli gene 
GI:145698337s ortholog is GL220915733, formed by a 
fusion of E. coli gene GL145698337 and GL16130722, 
in the Anaeromyxobacter dehalogenans 2CP-1 genome. 
RBH incorrectly predicted GL220915313 to be the 
ortholog of GI: 145698337 as they are the reciprocal best 
hits of each other. However, GI: 145698337 hits 
GL2209 15733 and GL220915313 with the same BLAST 
P-values 2e~ 19 (see details in Supplementary Table S4), 
and the orthologous pair (GL16131796, GL220915734) 
provides an additional support for the GOST call 
(GI: 145698337, GL220915733) to be the correct 
ortholog pair. Hence the real ortholog may not always 
have the highest sequence similarity to the query espe- 
cially when several homologs have similar BLAST 
P-values. Overall, GOST can overcome the general issue 
caused by gene fusions, while sequence similarity-based 
orthology mapping programs such as RBH have intrinsic 
difficulties. 

GOST is capable to identify orthologs in the presence of 
horizontal gene transfers 

Horizontally transferred genes (HTGs) (21) could affect 
orthologous gene mapping results because genes acquired 
by horizontal gene transfers may show higher sequence 
similarities to homologous genes in the donor or closely 
related organisms than the actual orthologs in pair-wise 
genome comparisons. While it is difficult to derive the 
detailed statistics of the impact of HGTs on sequence 
similarity-based orthology mapping programs due to the 
lack of large set of HTGs, we provide the following case 
studies to illustrate why GOST fares better than RBH in 
the presence of HTGs. 
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Figure 4. An illustration of impact of gene fusion on ortholog identification, (a) and (b) are two real examples corresponding to Figure 3(b) and (c), 
respectively, (a) The block arrow consisting of a red and a blue arrow represents a candidate ortholog of the red gene in E. coli recognized by GOST; 
and the two yellow arrows represent a pair of working partners which RBH failed to call a candidate ortholog. (b) The red-blue mixed (respectively 
green) block arrow represents a candidate ortholog of the red gene in E. coli recognized by GOST (resp. RBH) and the two yellow arrows represent a 
pair of working partners. The meanings of directed edges are same to those in Figure 3. 



Keeling et al. (22) recently classified HTGs into six 
categories, namely duplicative transfer, recent homolo- 
gous replacement, ancient homologous replacement, 
duplicative transfer with differential loss, sequential 
transfer and transfer of new gene. These six categories 
largely represent two types of gene transfers: 'duplicative 
transfers' and 'orthologous replacement transfers'. The 
former may cause incorrect calls of orthologs, while the 
latter generally will not as the original copy is lost. 
Supplementary Table S5 summarizes the orthologous 
gene mapping performance by GOST and RBH on five 
examples that involve HTGs, covering different scenarios. 
We highlight one case and refer the reader to 
Supplementary Figure S6 for the others. 

Consider orthology mapping from E. coli K12 to 
Halothermothrix orenii H 168, an evolutionarily distant 
organism from E. coli. RBH mapped GI: 16129840 of 
E. coli to GI:220933093 of H. orenii since they have the 
best reciprocal BLAST hits while GOST mapped the gene 
to GI:220931604. Comparing the species tree and the gene 
tree in Figure 5a, we note that the locations of 
GI:220931604 are consistent between the two trees while 



the locations of GI:220933093 are clearly not, implying 
that GI:220933093 is a recent HTG from an organism 
close to E. coli. 

Again, the key reason that GOST is generally not 
affected by HTGs is that it relies on both sequence simi- 
larity and contextual information to derive orthology re- 
lationships. This again highlights that a real ortholog does 
not always have the highest sequence similarity to the 
query. 



CONCLUDING REMARK 

Sequence similarity-based methods remain the dominating 
technique for large-scale orthologous gene mapping 
because its computational efficiency and generally accept- 
able prediction accuracy but sequence similarity alone 
could not guarantee orthologous relationship both theor- 
etically and practically. In this article, we presented a 
novel gene mapping procedure through integration of con- 
textual and sequence similarity information. The combin- 
ation of these two types of information clearly makes our 
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Figure 5. (a) A species tree containing E. coli (red) and H. orenii H 168 (green); and the corresponding gene tree contains the gene GI:16129840 (in 
red) from E.coli and its two predicted orthologous genes GI:220931604 and GI:220933093 (in green) from H. orenii by GOST and RBH, respectively, 
(b) An illustration of orthologous gene mappings for both RBH and GOST, where the red (respectively green) block arrow in H. orenii represents the 
candidate ortholog 220933093 (respectively 220931604) of 16129840 recognized by GOST (respectively RBH) and the two yellow arrows represent a 
pair of working partners. 



program more reliable with higher coverage in complex 
situations as shown above. This new tool provides an 
orthology mapping capability at an accuracy level com- 
parable to those of phylogeny-based approaches and yet 
efficient enough for large-scale applications. 
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